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Introduction 


Organization  Development  (OD)  is  a  term  used  to  describe  a  wide  range  of 
social -science  based  approaches  to  planned  organizational  chanqe  (Porras  & 

Berg,  1978a).  OD  is  a  planned,  systematic  process  of  organizational  change 
based  on  behavioral  science  technology,  research,  and  theory  (Beckhard,  1969; 
Hellreiqel  et  al.,  1973;  Herrington,  1976).  The  practice  of  OD  is  aimed 
toward  improving  the  quality  of  life  for  members  of  human  systems  and  increas¬ 
ing  the  institutional  effectiveness  of  those  systems  (Alderfer,  1977; 
Herrington.  1976).  With  the  organization  functioning  below  its  capacity,  it 
is  the  purpose  of  OD  to  determine  the  ultimate  causes  of  these  undesirable 
symptoms  and  then  to  devise  ways  to  eliminate  or  at  least  minimize  the 
ca.jse(s)  (Armenakis,  Feild,  &  Holly,  1976).  Eliminating  the  causes  of  undesir¬ 
able  symptoms,  then,  are  ,"ie  objectives  of  each  OD  intervention. 

Organization  Oevelopme"*  can  be  defined  as  a  sustained,  lonq-range  proc¬ 
ess  of  planned  organizational  chanqe  using  reflexive,  self-analytic  methods  of 
improving  the  functioning  of  an  organizational  system  (Bennis,  1969;  Campbell, 
Pownas,  Peterson,  t  Ounnette,  1974;  Cook,  1976:  Miles  &  Schmuck,  1971)  with 
emphasis  on  improvement  of  an  organization's  problem  solving  and  renewal  proc¬ 
esses  with  the  assistance  of  a  consultant  or'  "change  aqent"  (Flench  &  Bell, 
1973).  00  consultants  help  people  do  preventive  maintenance  on  their  rela¬ 

tionships,  problem  solving  abilities,  and  organizational  structures,  policies, 
and  procedures  (Vieisbord,  1931). 


00  Diaces  more  emphasis  than  do  other  approaches  (e.q.,  management 
development)  on  a  collaborative  process  of  data  collection,  diagnosis,  and 
action  for  arrivinq  at  solutions  to  problems  (Burke  R  Schmidt,  1970;  Cook, 
1976;  Hellriegel,  1973;  Herrington,  1976).  Improvement  of  a  dysfunctional 
organizational  state  implicitly  involves  standards  or  criteria  for  optimal 
performance.  Even  thouqh  what  is  considered  to  be  optimal  will  differ  from 
organization  to  organization,  all  00  efforts  will  be  similar  in  attempting  to 
identify  these  goals,  objectives,  and  criteria  for  optimal  performance. 
(Campbell  et  al.,  1974;  Cook,  1976). 

This  process  has  been  labelled  as  "action  research"  (Campbell  et  al., 

1974;  French,  1982;  Friedlander  i  Brown,  1974;  Hellriegel,  1973;  Nicholas, 
1979;  Weisbord,  1981)  and  underlies  most  of  the  interventions  that  have  been 
invented  in  the  evolution  of  00  (French,  1982;  Hellrieqel,  1973).  The  focus 
of  action  research  has  recently  shifted  from  exploration,  inquiry,  and  dis¬ 
covery  ' to  deliberate  alteration  and  improvement  of  organizational  structures 
through  purposeful  planning  and  systematic  methodologies  (Weisbord,  1981). 

The  action  research  process  involves  problem  identification,  consultation, 
data  qatherinq,  diagnosis,  feedback  to  the  client,  and  data  qatherinq  after 
action  (French,  1969). 

The  final  evaluative  staqe  collects  data  to  monitor,  measure,  and  deter¬ 
mine  effects  which  are  fed  back  to  clients  for  re-diagnosis  and  new  action 
(Nicholas,  1979).  To  evaluate  whether  or  not  stated  objectives  and  qoals 
have  been  achieved,  the  presence  or  absence  of  causes  and  symptoms,  i.e.,  the 
criteria,  must  be  determined  durinq  the  evaluation  phase  of  the  0D  inter¬ 
vention'  (Armenakis  et  al.,  1976).  The  design  of  01)  research  and  the 


measurement  of  effects  from  interventions  can  be  viewed  as  a  broader  activity 
called  evaluation  research  (Alderfer,  1977;  Burke  &  Schmidt,  1970).  The; 
purpose  of  evaluation  research  is  to  measure  the  effects  of  a  proqram  or 
intervention  aqainst  the  qoals  the  program  set  out  to  accomplish  to  improve 
future  proqramminq  (Weiss,  19/2).  This  means  that  the  role  of  the  researcher 
is  to  determine  whether  the  changes  in  the  system  are  the  result  of  the  00 
effort  or  the  result  of  extraneous  occurrences.  To  justify  the  time  and 
money  expended  in  the  00  intervention  as  well  as  to  allow  for  the  determina¬ 
tion  of  the  most  effective  technique  of  intervention  (Franklin,  1976; 
Nicholas,  1979,  Schuman,  1967),  the  researcher  must  attempt  to  establish 
cause  and  effect  (Oe  Meuse  &  liebowitz,  1981)  and  to  understand  the  under¬ 
lying  orocesses  contributing  to  the  observed  effect. 

Lewin  (1946)  emphasized  the  role  of  evaluation  in  action  research  as 
follows: 


If  we  cahnot  j.udge  whether  an.  action  has  led  forward  or 
backward,  if  we  have  no  criteria  for  evaluating  the  rela¬ 
tion  between  effort  and  achievement,  there  is  nothing  to 
orevent  us  from  makinq  the  wrong  conclusions  and  to 
►’courage  the  wrong  work  habits.  Realistic  fact-finding 
and  evaluation  is  a  prerequisite  for  any  learning  (d.35). 


Johnson  (1970)  and  Hawkridqe  (1970)  have  distinguished  between  summative 
and  formative  evaluations.  The  primary  purpose  of  a  summative  evaluation  is 
to  determine  an  overall  evaluation  of  a  proqram  as  it  already  exists.  Forma¬ 
tive  evaluations  use  data  collected  durinq  the  development • and  initial  tryout 
of  a  program  as  a  basis  for  improving  the  program.  Despite  this  distinction, 


all  evaluations  more  or  less  follow  the  orocedural  outline  provided  hy 
Hawkridqe  (1970).  According  to  Hawkridqe,  the  seven  phases  of  evaluation 
research  are  as  follows:  (1)  settinq  up  qoals  and  objectives  for  the 
evaluation,  {?.)  selectinq  objectives  to  be  measured,  (.3)  choosing  instru¬ 
ments  and  procedures,  (4)  selectinq  samples  for  the  intervention,  (5) 
establishing  measurement  and  observation  samples,  (6)  choosing  analysis 
techniques,  and  (7)  drawinq  conclusions  and  recommendations.  Each  of  these 
steps  will  be  discussed  in  detail  in  the  followinq  sections.  Issues  that 
must  be  taken  into  consideration  for  each  phase  will  be  presented.  Steps  *  4 
and  #  5  will  be  considered  together  as  part  of  the  larger  discussion  on 
research  desiqn  and  methodology.  Analysis  techniques  (#5)  will  also  be 
covered  in  this  desiqn  and  methodology  section. 


The  Process  of  Evaluation 


Setting  Up  Goals  and  Objectives  for  the  Evaluation 

Evaluation  of  a  proqram's  effectiveness  is  not  possible  unless  intended 
impacts  of  the  proqram  are  stated  in  clearly  measurable  terms  (Marqulies, 
Wriqht,  &  Scholl,  1977).  In  planning  for  the  evaluation  of  an  organization 
development  intervention,  it  is  important  that  specific  goal  selection  be 
accomplished  to  assure  that  data  appropriate  to  measurement  of  selected  goals 
will  he  available  or  attainable  (Hahn,  1970).  A  proqram  objective  or  goal  is 
simply  an  intended  impact  of  the  proqram  itself  on  some  target  population. 

To  specify  an  objective  clearly,  one  must  state  the  operations  by  which  it 
can  be  determined  whether  and  to  what  extent  the  objectives  have  been 
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obtained  (Johnson,  1970).  These  operations  are  then  the  measures  that  are 
needed  (Fitzpatrick,  1970;  Carver,  1970).  That  is,  if  objectives  are 
precisely  (usually  behavioral ly)  stated  (Campbell  et  al.,  1974;.  Carver,  1970; 
Franklin,  1976;  Hahn,  1970;  Johnson,  1970),  the  measurement  problem  is  all 
but  solved  (Carver,  1970). 


Selecting  Criteria  to  he  Measured 

According  to  Porras  and  Patterson  (1979,  o.  41),  "perhaps  the  most 
pressing  issue  in  00  assessment  research  is  the  problem  of  which  variable  to 
measure."  '  They  further  state  that  "it  is  easy  to  advocate  that  the  assessors 
should  measure  the  var,.*u,es  n<?ing  gcf-ir-ted  by  tp*  intervention,  but 
freguently  we  do  not  know  what  these  variables  are  ahead  of  time."  Many 
writers  have  emphasized  th°  i$e  of  "hard"  objective,  behavioral  measures 
(Armenakis  ft  Feild,  1976;  Armenakis  et  al . ,  1975).  but  frequently  OD 
'interventions  also  Dlan  to  impact  or  change  attitudes  also,  sometimes 
referred  to  as  "soft"  criteria. 

Armenakis  and  Feild  (1975)  further  distinguish  between  internal  and 
external  hard  criteria.  The  distinction  lies  in  the  degree  to  which  the 
criteria  are  influenced  by  chanqes  occurring  external  to  the  organization. 
Internal  hard  criteria  are  minimally  influenced  by  chanqes  occurring  external 
to  the  organization  and  are  readily  accented  by  organizational  members  as 
measures  of  orqani zatiooal  performance  (e.q.,  productivity).  External  hard 
criteria  would  be  considered  influenced  by  these  external  changes.  For 
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example,  Georgapoulos  and  Tannenbaum  (1957)  have  noted  that  "net  profit.. .  is 
a  poor  criterion  in  view  of  the  marly  unanticipated  fluctuations  external  to 
the  system,  e.q.,  fluctuations  in  the  general  economy,  market  sales,  and 
earnings,"  (p.  535). 

Campbell  et  al . ,  (1974)  have  identified  several  dependent  variables  that 
could  be  assessed  in  the  evaluation  of  organizational  effectiveness. 

Althouqh  initially  sugqested  as  effectiveness  criteria,  they  can  also  be 
applied  as  potential  outcome  criteria  of  a  program  intervention  evaluation, 
depending  on  the  objectives  of  the  intervention  (Fitzoatr ick,  1970).  Many  of 
these  can  be  classified  as  to  whether  they  are  a  "soft"  or  "hard"  criterion 
measures.  A  criterion  will  be  considered  "hard"  if  its  measurement  can 
potentially  be  obtained  throuqh  objective,  preferably  behavioral,  indices. 
Consensual  aqreement  of  attitudinal  precepts  will  be  considered  to  he  "hard" 
within  the  context  of  this  definition  since  the  determination  of  consensual 
aqreement  should  he  relatively  objective.  "Soft"  measures  will  be  those 
'  involving  subjective,  attitudinai  ratinqs  for  variables  havinq  no  easily 
identifiable  or  observable  criterion. 

Hard  criteria  include: 

Productivity.  Productivity  refers  to  the  quantity  or  volume  of  the  major 
product  or  service  the  organization  provides. 

Efficiency.  This  could  be  represented  as  a  ratio  that  reflects  a.  com¬ 
parison  of  some  aspect  of  unit  performance  to  the  costs  incurred  for  that 
performance. 
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Profit.  Profit  is  the  amount  of  revenue  from  sales  left  over  after  all 


costs  and  obligations  are  met. 

Accidents.  This  refers  to  the  frequency  of  on  the  gob  accidents  result¬ 
ing  in  lost  time. 

Growth.  Growth  refers  to  '.he  increase  in  such  thinqs  as  manpower, 
facilities,  assets,  and  innovations. 

Absenteeism. 

Turnover.  Turnover  refers  to  any  change  of  personnel  within  the  organi¬ 
zation. 

Control .  Control  refers  to  the  degree  and  distribution  of  management 
type  of  control  that  exists  within  an  organization  for  influencing  and 
directing  the  behavior  of  organ! zat ion  members. 

Goal  Consensus.  This  refers  to  the  degree  to  which  all  individuals 
oerceive  the  same  qoals  fr  •'  an  organization. 

Role  and  norm  congruence  Role  and  norm  congruence  refers  to  the  degree 
to  which  the  members  of  an  organization  are  in  planned  agreement  on  such 
things  as  whai.  Viols  of  supervisory  attitudes  are  best,  performance  expecta¬ 
tions,  morale,  role  requirement,  etc. 

Managerial  task  skills.  Refers  to  the  overall  level  of  skill  the  com¬ 
manding  officer,  managers,  or  qroup  leaders  possess  fur  performing  tasks 
centered  on  work  to  be  done,  and  not  the  skills  employed  when  interacting  , 
with  the  organizational  members. 

Soft  measures  include: 

Satisfaction  Satisfaction  could  be  described  as  an  individual's  percep¬ 
tion  of  the  degree  to  which  he  or  she  has  received  an  equitable  amount  of  the 
outcome  provided  by  the  organization. 


•  Morale.  Morale  is  a  predisposition  in  organizational  members  to  out 
forth  extra  effort  in  achievinq  organizational  goals  and  object iveness. 

Readiness.  Readiness  is  an  overall  judgment  concerning  the  probability 
that  the  organization  could  specifically  perform  some  specified  task  if  asked 
to  do  so. 

Below  are  additional  criteria  listed  by  Campbell  et  al .  (1974)  tnat  could 
entail  both  "soft"  and  "hard"  measurement  procedures: 

Cohesion/Conflict.  Cohesion  refers  to  the  extent  that  organization 
members  like  one  another,  work  well  toqether,  communicate  freely  and  openly, 
and  coordinate  their  work  efforts.  Conflict  refers  to  verbal  and  physical 
clashes,  poor  coordination,  and  ineffective  communication. 

Flexibility/Adaptation.  This  refers  to  the  ability  of  an  orya-ii  <:  itijn  to 
Change  its  standard  operating  procedures  in  response  to  environmental  changes 

Manaqerial/Interpersonal  Skills.  Refers  to  the  level  of  skills  and 
efficiency  with  which  management,  deals  with  supervisors,  subordinates  and 
peers  and  includes  the  extent  to  which  management  gives  support,  facilitates' 
constructive  interaction,  and  generates  enthusiasm  for  meetinq  goals  and. 
achieving  excellent  performance. 

Information  management  and  communication.  Refers  to  the  collection, 
analysis,  and  distribution  of  information  critical  to  organizational 
effectiveness. 
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Chops inq  I n s truments  and , Proced tires 

Related  to  the  question  of  which  variables  ard  to  be  measured  is  the  . 
question  of  how  to  qo  about  measuring  the  variables  (Gordon  &  Morse,  1975). 
Many  of  the  variables  that  have  been  identified  have  several  different 
operational  forms.  Existing  records,  direct  observation,  retrospective 
ratings  by  independent  observers,  and  self-perceptions  have  all  been  used  as 
sources  for  data  (Franklin  Thrasher,  1976).  Techniques  include  direct 
observation,  the  use  of  tests  and  questionnaires,  the  use  of  physical  evidence 
data,  and  the  use  of  archival  records.  According  to  Webb,  Campbell,  Schwartz, 
and  Sechrest  (1966),  each  of  these  techniques  must  be  viewed  in  terms  of  its 
"obtrusiveness"  or  reactive  effect  on  the  subject  or  program  participant. 
Obtrusiveness  refers  to  the  subject's  awareness  that  he  or  she  is  being 
measured  or  observed,  therefore  affectinq  the  behavior  of  interest.  An 
obtrusive  measure  alters  the  natural  course  of  the  behavior  as  it  would  have 
occurred  without  the  observation,  i.e.,  a  "guinea  piq  effect."  Within  this 
context,  self-report  questionnaire  data  and  direct  observation  are  the  most 
obtrusive,  while  the  use  of  archival  measures  and  physical  evidence  are  the 
least  obtrusive.  These  last  two  techniques  minimize  the  need  to  disturb 
subjects  and  lessen  the  extent  to  which  the  measurement  process  itself  chanqes 
the  behavior  of  interest. 

Many  have  advocated  the  use  of  unobtrusive  measurement  techniques  in  the 
evaluation  of  OH  interventions,  but  few  have  suggested  specific'  procedures  for 
carrying  this  recommendation  out  (Cumminqs,  Molloy,  ^  Glen,  1977).  Direct 
observation  >s  very  costly  and  could  be  highly  obtrusive,  but  does  not  rely 


•  °  O  O  c  °  -  -  -  - 

on  a  subject's  retrospective  account  or  his  or  her  subjective  imoressions. 
Archival  records  are  unobtrusive,  cheap  to  obtain,  easy  to  sample,  and  the 
population  restrictions  associated  with  them  are  often  knowable  However, 
Campbell  (1969,  p.  415)  warns  that  "those  who  advance  the  use  of  archival 
measures  as  social  Indicators  must  face  up  not  only  to  their  high  degree  of 
chaotic  error,  but  also  to  the  politically  motivated  changes  in  record  keep¬ 
ing  that  follow  upon  their  public  use  as  social  indicators." 

Although  behavioral  indices  of  chanqe  are  preferred  to  the  less  objective 
measures,  they  are  almost  always  more  difficult  and  costly  to  obtain,  hence 
ihe  continual  reliance  on  self-report  questionnaire  instruments.  Question¬ 
naires  are  relatively  inexpensive  and  allow  the  collection  of  data  from  a 
large  sample  simultaneously.  In  addition,  they  easily  lend  themselves  to 
statistical  analysis,  a  feature  lacking  to  a  great  degree  with  unobtrusive 
physical  accretion  and  erosion,  techniques.  However,  Pate,  Nielsen,  and  Bacon 
(1977)  warn  that  "the  exclusive  use  of  questionnaire  instruments  in-  the 
assessment  of  organizational  change  capitalizes  on  chance  outcomes  and  does 
not  allow  the  researcher  to  obtain  convergence  on  his  or  her  results."  They 
also  add  that  "such  practice  does  not  permit  the  researcher  to  adequately 
handle  the  problem  of  response  bias"  (p.  454). 

The  use  of  questionnaires  also  brings  up  to  the  psychometric  consider¬ 
ations  of  reliability  and  validity  (Morrison,  1973).  Rel iabil ity  refers  to 
the  consistency  with  which  a  measurinq  device  yields  identical  results  when 
measuring  Identical  phenomenon.  Validity  is  concerned  with  how  well  a  meas¬ 
ure  captures  the  essence  of  the  phenomenon  of  interest.  Issues  of  relia¬ 
bility  and  validity  must  be  seriously  considered  whenever  a  "tailor-made" 
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questionnaire  is  developed  according  to  the  needs  of  a  specific  organization 
(Armenakis,  Feild,'and  Holley,  1976).  Often  the  demonstration  of  reliability 
and  validity  will  take  lonqer  than  the  requirements  of  expediency  of  the  OD 
intervention  allow  for.  Gordon  and  Morse  (1976)  warn  that  "the  lack  of 
sensitive,  validated,  and  reliable  measurenent  instruments  limits  current 
attempts  at  evaluation"  (p.  343).  The  requirements  of  a  test  instrument's 
reliability  and  validity  will  frequently  compromise  "tailor-made"  measurement 
devices.  Several  other  problems  arise  from  the  use  of  the  questionnaire  as 
the  source  and  neans  of  data  collection  (Alderfer,  1977:  Carver,  1970: 
Golembiewski,  Billingsley,  and  Yeager,  1976a:  Pate  et  al.,  1977).  For  example. 
Carver  (1970)  states  that  principles  that  have  been  validly  developed  for 
measuring  between  individual  differences  are  invalidly  used. for  measuring 
within  individual  change  or  group . di fferences .  His  conclusion  rests  on  the 
contention  that  items  excluded  in  the  construction  of  a  measurement  device  to 
assess  individual  difference’s  are  the  very  items  that  should  be  included  in 
order  to  determine  whether  any  change  took  place.  Due  to  the  way  the  test  was 
constructed,  in  tests  of  individual  differences  the  relationship  between  the 
test  score  and  the  variable  measured  are  not  linear.  Thus,  a  difference  or 
change  detected  at  one  end  of  the  scale  may  not  reflect  the  same  difference  at 
another  locale  on  the  scale. 

Golembiewski  et  al .  (1976a)  proposed  that  the  entire  concept  of  change  is 
in  need  of  cl ar i f icat ion,  particularly  as  it  is  accomplished  through  survey 
and  questionnaire  techniques.  According  to  Randolph  (193?,  p.  119),  "a 
unitary  concept  of  change  may  be  inappropriate  and  misleading,  both  in  terms 
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of  ov or -estimation  and  under-estimation °of  organizational  change."  Changes  „ 
from  00  Interventions  may  Involve  any  one  or  all  of  the  following  three  con¬ 
ceptually  distinct  types  of  change  as  recently  operationalized  by  Golembiewski 
et  al.  (1976a):  Alpha,  Beta,  and  Gamma  change.  Terborg  et  al.  (1980,  p. 

Ill)  state  that  "it  is  important  to  understand  which  type  of  chanqe  has 
occurred  if  the  effects  of  interventions  are  to  be  unambiguously  examined." 
Without  an  assessment  of  these  chanqe  types,  00  researchers  may  be  led  to 
conclude  that  a  situation  is  deteriorating  or  that  no  chanqe  has  occurred 
when  in  fact  chahqe  has  occurred  (Alderfer,  1977;  Armenakis,  Feild,  A  Holley, 
1976;  Golembiewski,  et  al.,  1976a;  Lindell  &  Drexler,  1979;  Macy  &  Peterson, 
1983;  Porras  &  Patterson,  1979;  Randolph,  1982). 

Gamma  change  occurs  when  the  subject,  over  time  and  as  a  result  of  the  00 
intervention,  changes  his  or  her  understanding  of  the  criterion  beinq  meas¬ 
ured  (Zmud  &  Armenakis,  1978).  This  type  of  change  involves  a  redefinition 
of  concepts  previously  defined  (Alderfer,  1977;  Armenakis  Smith,  1978; 

Porras  &  Patterson,  1979).  For  example,  if  a  factor  analysis  of  a  question¬ 
naire  indicates  that  several  items  measure  a  specific  dimension,  say  leader¬ 
ship,  then  a  factor  analysis  of  a  data  set  obtained  subsequently  should 
produce  the  same  items  measurinq  the  same  dimension,  i.e.,  the  factor  struc¬ 
tures  should  be  identical.  However,  it  could  be  that  the  planned  OD  inter¬ 
vention  was  directed  or  intended  to  enhance  the  subjects'  understanding  of 
the  concept  of  leadership.  If  subjects  have  redefined  a  criterion  durinq  a 
change  program,  then  questionnaire  responses  before  the  intervention  may  have 
little  resemblance  to  responses  after  intervention  and  a  comparison  of 
responses  would  be  meaninqless  and/or  misleading  (Armenakis  &  Smith,  1978; 
Armenakis  &  Zmud,  1979). 


12 


Beta  chanqes  are  changes  in  perceptions  of  a  dimension  as  determined  by  a 
measurinq  instrument  in  which  scale  intervals  have  varied  over  time.  Beta 
chanqe  can  occur  when  no  actual  behavior  chanqe  is  recorded  as  a  change  by 
respondents.  Suppose  that  a  supervisor  at  a  second  measurement  is  no  more  or 
less  supportive  than  he  or  she  was  at  the  first  measurement.  One  still  might 
find  a  change  in  the  supervisor's  score  on  the  scale  if  those  who  rated  him 
or  her  chanqed  the  way  thev  used  the  scale  (Lindell  l  Drexler,  1079).  In 
using  self-report  questionnaires,  researchers  assume  that  individuals  using 
them  in  evaluating  themselves  or  the  situation  have  an  internalized  standard 
for  judging  their  level  of  functioning  with  regard  to  a  given  dimension,  and 
that  this  internalized  standard  will  not  chanqe  from  pretest  to  posttest. 
Researchers  must  be  able  to  state  what  each  particular  score  on  the  pretest 
set  of  scores  is  equivalent  to  on  the  posttest  set  of  scores,  i.e.,  a  common 
metric  must  exist  between  he  two  sets  of  scores  (Cronbach  &  Furbv,  1970). 

If  the  standard  of  measurement  chanqes  between  the  pretest  and  posttest,  the 
two  ratings  will  reflect  this  difference  in  addition  to  chanqes  attributable 
to  the  experimental  manipulation  (Howard,  Schmeck,  %  Bray,  1.979).  Conse¬ 
quently,  comparisons  of  the  ratings  will  be  invalid.  This  threat  to  the 
internal  validity  of  evaluation  desiqn  has  also  been  referred  to  as  "instru¬ 
mentation"  by  Campbell  and  Stanley  (1966)  and  as  "the  response  shift  bias"  by 
Howard  and  Oailey  (1979). 

Alpha  chanqe  is  that  change  which  is  detected  along  a  consistent  measure¬ 
ment  scale  (i.e.,  no  beta  change)  and  for  which  gamma  chanqe  has  been  ruled 

out  (Alderfer,  1977;  Armenakis  ^  Zmud,  1979;  Oolembiewski  &  Billingsley,  _ 

1910;  Lindell  &  Orexler,  1979;  Porras  &  Patterson,  1979;  Zmud  &  Armenakis, 


1978).  In  other  words,  the  phenomenon  itself,  and  neither  the  subject's 
understanding  of  it  nor  the  scale  units  has  changed  (Arnenakis  &  Zmud,  1979). 
Alpha  change  takes  place  vrfien  an  actual  behavioral  change  is  recorded  as  such 
by  respondents.  For  example,  a  change  occurs  when  a  respondent,  on  a  leader¬ 
ship  scale,  indicates  leader  behavior  as  changinq  from  a  "2"  to  a  "3"  when  in 
fact  the  leader's  behavior  has  changed  by  that  amount  ( Armenak  is  4  Smith, 
1978). 

These  prior  explanations  and  illustrations  allow  for  a  fuller  understand¬ 
ing  of  the  formal  definitions  of  alpha,  beta,  and  qamma  change  as  initially 
provided  by  Golembiewski  et  al .  (1976a)  and  referred  to  by  many  others 
(Golembiewski  4  Billingsley,  1980;  Lindell  &  Drexler,  1979;  Macy  &  Peterson, 
1983;  Roberts  A  Porras,  1982;  Porras  A  Patterson,  1979): 

ALPHA  CHANGE  involves  a  variation  in  the  level  of  some 
existential  state,  given  a^  constantly  calibrated  meas¬ 
uring  instrument  related  to  a  constant  conceptual  domain. 


BETA  CHANGE  involves  a  variation  in  the  level  of  some 
existential  state,  complicated  by  the  fact  that  some 
intervals  of  the  measurement  continuum  associated  with 
a  conceptual  domain  have  been  recalibrated. 


GAMMA  CHANGE  involves  a  redefinition  or  reconceptuali¬ 
zation  of  some  domain,  a  major  change  in  the  perspective 
or  frame  of  reference  within  which  phenomena  are  perceived 
and  classified,  in  what  is  taken  to  be  relevant  in  some 
slice  of  reality  (p.  134). 

Differentiating  alpha,  beta,  and  gamma  chanqe  is  of  special  importance  to 
researchers  because  this  typology  Is  closely  intertwined  with  the  objectives 
of  behavioral  interventions  (Alderfer,  1977;  Armenakis  &  Zmud,  1979;  Zmud 
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&  Armenakis,  1978).  If  the  purpose  is  to  improve  leader  behavior  and  to 
reflect  this  improvement  by  measurinq  subordinate  perceptions  of  leader 
behavior,  then  alpha  change  may  be  intended  (Armenakis  %  Zmud,  1979).  On  the 
other  hand,  the  purpose  miqht  be  to  change  respondents'  understanding  of 
leadership  and  then  gamma  change  may  be  intended.  Golembiewski  and 
Billingsley  (1980)  state  that  "ganma  change  constitutes  the  goal  of  many 
planned  interventions"  because  00  seeks  to  change  "the  concepts  of  the 
guality  of  organization  life  that  should  and  can  exist"  (Golembiewski  et  al . , 
1976a). 

Alpha,  beta,  and  qamma  change  may  be  caused  by  the  sources  of  invalidity 
and/or  the  013  effort  (Armenakis  fc  Smith,  1978).  A  true  experimental  research 
design  would  help  to  determine  whether  the  00  intervention  caused  the 
observed  changes,  (experiments  and  research  designs  will  be  discussed  more 
fully  later).  However,  comparison  qroup  designs  are  often  impossible  in 
organization  development  research.  It  is  the  absence  of  a  comparison  group 
in  combination  with  the  use  of  questionnaires  that  can  result  in  a  difficult 
determination  of  the  presence  and  degree  of  change  that  occurred  as  a  result 
of  the  organization  development,  intervention.  Many  authors  have  addressed 
this  issue  (Armenakis  Ri  Zmud,  1979;  Golembiewski  i  Billingsley,  1980;  Macy 
Peterson,  1983;  Randolph,  198?;  Terborg,  Howard,  &  Maxwell,  1980), 
suggesting  a  two  step  process  in  order  to  determine  the  effects  of  an  00 
intervention  as  a  result  of  alpha  change:  1)  detect  gamma  change  first,  for 
if  it  exists,  beta  and  alpha  change  cannot  be  detected;  2)  if  it  can  be  shown 
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that  gamma  change  has  not  occurred,  beta  chanqe  must  be  then  assessed,  for  if 
it  exists  alpha  change  cannot  be  assessed.  Only  if  gamma  and  beta  chanqe  are 
discounted  can  alpha  change  be  assessed. 

Golembiewski  et  al.  (1976b)  suqqest  testinq  for  differences  in  the  facto¬ 
rial  structures  of  measures  across  time  as  an  operational  way  to  determine 
whether  gamma  chanqe  took  place.  Zmud  and  Armenakis  (1978)  describe  the 
rationale  (Ahmavaara,  1954)  for  using  the  procedure  as  follows: 


Since  qamma  chanqe  involves  a  redefinition  of  criterion 
being  investigated,  subject  response  structures  (as 
determined  throuqh  factor  analysis)  that  result  from 
each  administration  of  the  measurement  device  must  be 
compared  (p.  666). 


This  comparison  would  entail  the  amount  of  common  variance  shared  between  the 
pre-  versus  post-intervention  structures  (Golembiewski  &  Bi 11 ingsley, , 1980) . 

A  very  high  congruence  between  before  and  after  structures  siqnals  that  no 
gamma  change  has  occurred.  Golembiewski  and  Billinqsley  (1980)  have  set  a 
cutoff  of  50  percent  common  variance  or  less  as  indicating  the  possibility  of 
gamma  change  occurring,  while  Macy  and  Peterson  (1983)  state  that  if  the 
common  variance  is  greater  than  85  percent,  it  can  be  safely  concluded  that 
any  measured  changes  are  not  gamma  changes. 

Zmud  and  Armenakis  (1978)  have  offered  a  methodology  for  assessing  beta 
change  using  questionnaires.  They  suggested  that  alpha  and  beta  chanqes  can  be 
differentiated  when  ore  and  post  ratings  are  collected  both  on  actual  and  ideal 
criterion  levels.  Through  comparison  of  actual  scores,  ideal  scores,  and 
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differences  between  actual  and  ideal  scores,  they  maintain  it  is  possible  to 
infer  alpha  or  beta  change,  assuming  no  gamma  chanqe.  If  ideal  scores  have 
changed,  respondents  have  recalibrated  the  measurement  scale  (Randolph,  1982). 
Examination  of  difference  scores  will  clarify  whether  beta  changes  or  both 
alpha  and  beta  changes  have  occurred. 

If  gamma  and  beta  change  are  discounted,  the  next  step  is  to  assess  for 
alpha  change.  Terborq,  Howard,  and  Maxwell  (1980)  suggest  that  this  be  done 
bv  using  trtest  comparisons  of  mean  differences  between  treatment  and  com¬ 
parison  groups. 

Linde! 1  and  Drexler  (1979,  p.  14)  maintain  that  the  importance  of 
Golanbiewski's  conceptual  distinctions  of  chanqe  are  "substantially  over¬ 
stated."  Their  arqument  lies  in  assertinq  that  changes  in  factor  structure 
can  also  be  attributed  to  alpha  and  beta  changes  (therefore  demonstrating  the 
insignificance  of  gamma  chanqe  considerations),  and  that  beta  chanqe  will  not 
occur  if  "psychometricaTTy  sound"  instruments  are  used,  i.e.,  tests  consisting 
of  reliable  scales  consisting  of  multiple  items  with  behavioral  anchors. 

Having  dispensed  with  gamma  and  beta  chanqe,  Lindell  and  Drexler  (1979,  p.  18) 
argue  that  consideration  of  alpha  chanqe  alone  is  sufficient,  "since  there  is 
little  doubt  that  a  psychometrically  sound  questionnaire  needs  to  be  inter¬ 
preted  as  anything  other  than  face  value." 

In  response  to  Lindell  and  Drexler's  first  point,  Golembiewski  and 
Billinqsley  (1980,  p.  101)  state  that  "our  critics  have  obviously  missed  the 
point"  regarding  alpha  change  since  "by  our  definition,  alpha  change  implies 
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no  appreciable  change  between  pre  and  post  Intervention  factorial  structures." 
In  support  of  Golembiewski  and  Billingsley's  argument,  Randolph  (1982)  has 
since  demonstrated  that  qamma  change  can  occur  without  alpha  change.  In 
response  to  lindell  and  Drexler's  second  point,  Golembiewski  and  Billingsley 
(1980)  state  their  critics  fail  to  recoqnize  that  the  present  state  of  00 
assessment  technology  does  not  meet  the  "psychometrically  sound"  criteria,  nor 
do  they  orovide  for  a  means  of  detecting  beta  change,  qiven  the  likelihood  of 
its  occurrence. 

The  time  at  which  the  post  Intervention  measurement  is  taken  is  an  issue 
that  also  must  be  considered  (Armenakis,  Fe i Id,  &  Holly,  1976).  Measurements 
taken  Immediately  after  an  intervention  may  reflect  more  clearly  soecific 
learnings  from  the  program.  On  the  other  hand,  delayed  measuring  may  show 
that  effects  which  initially  appeared  to  he  strong  have  weakened  or 
disappeared.  Porras  (1977)  founa  that  the  longer  the  time  between  the  end  of 
the  active  intervention  process  and  the  last  measurement  of  the  research 
variables  the  fewer  significant  changes  were  reported.  Morrison  (1978,  p.  43) 
states  that  "practitioners  hold  that  00  is  an  ongoing  process  and  not  a 
time-bound  intervention,  and  therefore  traditional  means  of  evaluation  do  not 
apply." 

In  summary,  "the  measurement  process  needs  much  innovation  and, develop¬ 
ment"  (Porras  &  Patterson,  1979,  p.  56).  The  measurement  process  stands  at 
the  interface  between  the  respondent's  behaviors  and  attitudes  and  the 
researcher's  abstraction  of  those  phenomena.  Porras  and  Patterson  (1979) 
state  that  despite  its  critical  role  in  this  linkage  process,  "our  abilities 
to  measure  adequately  are  not  receiving  heavy  emphasis  or  concentrated 
development"  (p.  56). 
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In  the  meantime,  the  use  of  multiple  measures  is  advocated,  providing 
convergent  evidence  that  an  accurate  assessment  is  being  made  concerning  the 
presence  or  absence  of  the  variable  of  interest  (Campbell  et  al.,  1974; 
Cummings  et  al.,  1977;  Fitzpatrick,  1970;  Golembiewski  et  al.,  1976;  Pate  et 
al.,  1977;  Webb  et  al . ,  1966).  As  will  be  discussed  more  fully  later.,  the 
majority  of  QD  interventions  rely  on  self-report  questionnaires  in  the  data 
collection  phase  of  the  evaluation.  Pate  et  al.  (1977)  state  that  "the 
exclusive  use  of  questionnaire  instruments  in  the  assessment  of  organizational 
change  capitalizes  on  chance  outcomes  and  does  not  enable  the  researcher  to 
obtain  convergence  of  his  or  her  results"  (p.  467).  Webb  et  al .  (1966)  state 
that: 


The  mistaken  b.;1 ief  in  the  operational  definition  of 
theoretical  terms  has  permitted  social  scientists  a 
complacent  and  self-defeating  dependence  upon  sinqle 
classes  of  measurement,  usually  the  interview  or  ques¬ 
tionnaire.  Yet  the  operational  implication  of  the 
inevitable  theoretical  complexity  of  every  measure  is 
exactly  opposite:  it  calls  for  multiple  operational  ism, 
that  is,  for  multiple  measures  which  are  hypothesized 
tn  share  in  the  theoretically  relevant  components  but' 
have  different  patterns  of*  irrelevant  components. 


The  advantage  of  usinq  more  than  one  mode  of  measurement  is  the  opportunity  to 
determine  the  method  variance  in  the  measurement,  thus  providing  a  more  accu¬ 
rate  determination  of  the  variable's  true  value,  and  hopefully  more  insight 
about  the  variable  itself  (Campbell  et  al.,  1974).  From  this  perspective,  the 
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use  of  self-report  attitude  questionnaires  is  relevant,  provided  that  they  are 
used  as  a  sample  of  the  total  measurement  universe  (Fitzpatrick,  1970). 

Research  Oesign  and  Methodology 

As  stated  earlier,  the  purpose  of  evaluation  research  is  to  measure  the 
effects  of  a  proqram  or  intervention  against  the  qoals  the  intervention  set 
out  to  accomplish.  TheTesearcher  must  determine  the  presence  of  chanqe  as 
well  as  establish  a  causal  connection  between  the  program  intervention  and  the 
subsequent  effects.  Plans  for  carrying  these  tasks  out  are  referred  to  as 
research  designs.  The  following  section  will  discuss  the  strengths  and 
weaknesses  of  some  of  the  research  designs  frequently  used  by  00  evaluators. 

The  ultimate  test  of  the  strength  of  any  research  design  relates  to  its 
Internal  and  external  validity,  (Armenakis  et  al.,  1976;  Campbell,  1969; 
Campbell  &  Stanley,  1966;  Cummings  et  al.,  1977;  Duncan,  1981;  Evans,  1975; 
Morrison,  1978;  Posavac  &  Carey,  1980:  Staw,  1980).  The  research  design  is 
internally  valid  if  it  allows  the  researcher  to  eliminate  alternative 
explanations  or  rival  hypotheses  relative  to  the  intervention  and  the  outcome. 
Campbell  (1969)  stat^/that  the  mere  possibility  of  some  alternative  explana¬ 
tion  is  not  enouqh  -  it  is  only  the  plausible  rival  hypotheses  that  are 
invalidating.  If  one  can  confidently  state  that  the  intervention  program 
caused  the  observed  effects  the  desiqn  is  internally  valid.  If  the  results 
obtained  can  be  accurately  generalized  to  other  subjects,  situations,  and 
settings,  the  desiqn  is  externally  valid. 


External  validity  asks  whether  the  experiment's  findinqs  can  be  generalized 
beyond  the  specific  population,  environment,  and  operational  definitions  of 
the  independent  and  dependent  variables  used  in  the  study  (Cumminqs  et  al., 
1977)..  Campbell  and  Stanley  (1966)  have  identified  four  threats  to  external 
validity: 

Interaction  effects  of  testing.  This  refers  to  the  effects  of  a  pretest 
in  modifying  a  subject's  responsiveness  to  the  proqram  intervention,  thus 
threatening  any  qeneral ization  to  an  unpretested  population. 

Interaction  effects  of  selection  and  treatment.  The  treated  population  ■ 
may  be  more  responsive  and  hence  unrepresentative  of  the  universal  population. 

Reactive  effects  of  the  experimental  arrangements.  This  refers  to  the 
artificiality  of  the  experimental  settinq  which  makes  it  atypical  of  settings 
to  which  the  treatment  is  .0  be  regularly  applied. 

Multiple  treatment  interference.  This  refers  to  the  interaction  between 
several  different  proqrams  taking  place  simultaneously.  . 

Threats  to  Internal  Validity 

Campbell  and  Stanley  (1966)  have  also  identified  nine  threats  to  internal 
validity,  also  referred  to  as  sources  of  invalidity.  Since  threats  to 
internal  validity  are  the  primary  concern  of  proqram  evaluators  (Posavac  K 
Carev,  1980),  future  discussion  of  research  designs  and  their  attempts  to 
establish  the  effect  of  a  planned  organization  development  intervention  will 
focus  exclusively  on  these  internal  threats.  These  threats  can  be  understood 
as  possible  research  errors  that  can  make  the  determination  of  cause  and 
effect  difficult  if  not  impossible  (Ouncan,  1981).  The  n:ne  threats  as 
identified  bv  Campbell  and  -Stanley  (1966)  are  as  follows: 


History.  History  refers  to  those  events,  in  addition  to  the  proqram  . 
intervention,  which  occur  simultaneously  between  the  first  and  second  meas¬ 
ures  in  the  dependent  variable  and  thus  provide  an  alternative  explanation  for 
the  changes  observed.  It  is  a  change  that  affects  the  organizational  unit  but 
is  not  related  to  the  00  effort.  tJsinq  some  type  of  comparison  group  that 
does  not  receive  the  program  but  is  exposed  to  the  sane  historical  events 
should  control  for  this  threat  to  internal  validity. 

Maturation.  Maturation  refers  to  changes  within  an  organization  as  a  unit 
.  and/or  its  members  as  a  function  of  the  organization* s  or  individual's  own 
natural  development,  that  are  independent  of  the  00  effort  and  are  operating 
as  a  function  of  the  passage  of  time.  Many  authors  point  out  the  lack  of 
usirlg  long-term  follow-up  procedures  in  assessing  the  presence  or  absence  of 
an  < f feet  resulting  from  an  00  intervention.  Some  conclude  that,  qiven  the 
state  of  the  art  of  present  00  evaluation  methodology,  maturation  as  a  pos¬ 
sible  source  of  internal  validity  is  of  little  concern  since  the  evaluation 
assn  clients  cover  only  a  short  period  of  time.  If  the  individual  is  taken  as 
the  unit  of  analysis,  adults  (versus  children)  employed  bv  an  organization 
would  be  expected  to  have  already  attained  a  steady  state  of  maturity.  Howeve 
this  assertion  can  only  be  applied  to  physical  maturation,  and  maturation  as  3r 
potential  source  of  invalidity  must  not  be  so  casually  eliminated  from  consid¬ 
eration.  Campbell's  (1963)  "continued  improvement"  thesis  demonstrates  how 
development  maturati  .au  occur  at  the  organizational  level.  The  thesis 
states  that  any  reliable  organization  is  expects !  in  improve  its  performance 
natirally,  by  virtue  of  the  organization's  purpose  to  achieve  a  common  goal 
(Arnenakis  &  Feild,  1975). 
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Instability.  Instability  refers  to  the  unreliability  of  a  measure.  This 
can  apply  to  questionnaires  that  are  necessarily  imperfect  measures  of  the 
criterion  and  to  human  judges  who  may  use  inconsistent  standards  or  qrow 

fatigued  with  an  increasing  number  of  observations  of  the  criterion  behavior.  ‘ 

Testing.  Testing  is  defined  simply  as  the  effects  of  taking  a  test  on  the 
scores  of  a  subsequent  test.  Taking  a  pretest  could  sensitize  a  respondent 
and  subsequently  influence  his  or  her  responses  on  the  posttest.  An  example 

of  this  would  be  the  Hawthorne  effect  in  which  the  subjects  react  to  obtrusive 

measurement  techniques.  The  early  Hawthorne  studies  demonstrate  that  when 
intact  work  groups  are  sinqled  out  for  special  attention,  chanqes  in  the 
dependent  variable  may  not  be  wholly  attributable  to  changes  in  the  independent 
variable  (White  4  Mitchell,  1976).  Another  example  of  testing  would  be  when 
participants  in  the  09  effjrt  attempt  to  respond  to  subsequent  administrations 
of  the  same  or  similar  questionnaires  differently  because  the  first 
administration  made  them  sensitive  to  what  was  desired  by  the  change  agent. 
Margulies  et  al.  ,(1977)  state  that  if  the  pretest  is  influential  in  focusing 
attention  to  oroblen  areas,  it  should  be  considered  as  part  of  the  planned  00 
intervent  ion.  A  way  to  control  for  the  effects  of  testing  would  be  to  use 
nonreactive,  unobtrusive  'treasures  f^bb  et  al.,  1966)  and  to  collect  "hard" 
data  (Gclembiewski  et  al.,  1976). 

Instrumentation.  Instrumentation  refers  to  changes  in  the  calibration  of 
a  measuring  instrument  or  changes  in  the  observer  which  result  in  changes  in 
the  obtained  measurements.  As  an  example,  the  "new  broom"  that  introduces 
abrupt  changes  of  policy  is  also  apt  to  reform  the  record  keeping  procedures. 
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and  thus  confound  reform  effects  with  instrument  change  (Campbell,  1969).  As 
another  example,  an  organizational  menber  exposed  to  a  program  designed  to 

o 

Improve  team  cohesiveness  may  Indicate  that  his  organization  has  changed  from 
a  "2"  to  a  "3"  on  a  cohesiveness  scale  when  in  fact  no  change  has  occurred. 

In  each  of  these  examples,  the  standard  of  measurement  has  changed,  i.e.,  the 
instrument  has  been  recalibrated. 

Statistical  regression.  This  threat  to  internal  validity  refers  to  the 
movement  of  an  individual's  extreme  score  toward  the  mean  on  a  subsequent 
administration  of  the  assessment  device.  In  the  field  of  organization  devel¬ 
opment,  subjects  or  groups  are  often  selected  for  Darticipation  in  the  inter¬ 
vention  program  because  of  a  state  of  need  as  reflected  in  their  extreme 
scores  on  an  assessment  device  (Campbell,  1969).  Organizations  seeking  ser¬ 
vices  from  00  consultation  programs  are  often  severely  deficient  . in  a  desired 
area  or  lacking  in  areas  of  standard  performance.  Groups  scoring  poorly  on 
the  first  administration  of  a  test  are  likely  to  have  as  one  component  of 
their  low  score  an  extreme  error  term  that  depresses  their  score.  On  a  sub¬ 
sequent  administration  of  the  test,  the  extreme  conditions  contributing  to  the 
poor  score  are  not  likely  to  be  present  to  the  same  degree  that  they  were  at 
the  initial  administration.  The  absence  of  these  depressing  factors  will 
enhance  the  score;  hence,  the  score  regresses  toward  the  overall  mean  (the 
reverse  loqic  applies  in  the  Instance  of  an  extremely  hiqh  score).  A  chanqe 
that  is  due  to  statistical  regression  may  be  confused  with  a  chanqe  produced 
by  the  intervention. 


Selection.  Selection  refers  to  biases  resultinq  from  differential 
recruitment  of  comparison  groups,  producing  different  mean  levels  on  the 
measures  of  the  effects.  This  source  of  internal  invalidity  often  occurs 
with  the  nonrandom  assignment  of  subjects  to  treatment . and  comparison  groups. 
Any  changes  in  the  effectiveness  of  the  organization  could  be  explained  by 
the  initial  'differences  in  relevant  characteristics  in  the  two  groups.  As  a 
result  of  the ir  initial  differences,  the  two  groups  may  have  differed  on  the 
outcome  criterion  measure  regardless  of  the  OD  intervention. 

Experimental  mortality.  Mortality  refers  to  the  differential  loss  of 
respondents  from  the  groups  being  observed. 

Interaction  effects.  This  threat  to  internal  validity  refers  to  the 
instance  when  two  or  more  of  the  above  errors  interact  to  confound  the 
results  of  a  research  design.  The  interaction  effect  most  commonly  referred 
to  as  a  threat  to  the  i n":e*  nal  validity  of  an  experimental  design  is  the 
selection-maturation  inters* 'on,  where  differential  rates  of  maturation  or 
autonomous  change  occur  us  a  result  of  selection  bias. 

In  addition  to  the  threats  enumerated  bv  Campbell  fc  Stanley  (1%6), 
another  qrouo  of  possible  confounds  exist  that  will  be  referred  to  as 
"relationship  effects."  These  effects  refer  to  the  relationship  (usually 
unconscious)  between  consultant  and  participant  that  serves  to  enhance  the 
likelihood  of  a  positive  outcome.  For  example,  if  organizational  members 
know  and  respect  the  chaiqe  aqent,  a  halo  effect  could  occur  causing  subjects 
to  supply  the  desired  results  reqardless  of  the  intervention  employed.  The 
potential  for  halo  effects  provides  strong  argument  for  the  use  of  external 
consultants  and  evaluators,  i.e.,  those  individuals  not  directly  involved 
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with  the  organization  receiving  the  treatment.  Other  possible  relationship 
effects  include  placebo,  Pygmalion,  and  experimenter  demand  effects. 

Considerable  attention  has  been  devoted  to  the  development  of  research 
.designs  which  can  be  applied  to  assess  whether  an  observed  change  can  be 
attributed  to  the  00  intervention  (Campbell  ft  Stanley,  1966;  Cook  ft  Campbell, 
1979).  These  research  desiqns  can  be  evaluated  in  terms  of  the  degree  to 
which  they  control  for  the  various  sources  of  internal  invalidity  (Howard  et 
al.,  1979;  Margulies,  1977).  The  more  precisely  a  research  desiqn  controls 
for  these  errors  the  more  adequate  it  becomes.  In  a  totally  artificial 
laboratory  situation  most  of  the  errors  can  at  least  be  measured  and  their 
Influence  on  the  criterion  of  interest  can  be  considered.  However,  as  the 
laboratory  environment  is  removed,  problems  begin  to  develop  in  controlling 
the  sources  of  invalidity.  The  further  removed  from  the  laboratory,  the  less 
rigorous  the  evaluation  methodology  becomes.  Terpstra  (1931)  has  found  that 
the  number  of  positive  evaluations  of  organization  development  interventions 
Increases  as  the  methodological  rigor  of  the  desiqns  decrease.  Bass  (198.3) 
suggests  that  research  outcomes  in  the  less  riqorous  desiqns  can  just  as 
easily  be  attributed  to  i.nvestiqator  bias,  and  to  placebo,  Hawthorne,  and 
Pygmalion  effects  on  the  participants.  Gordon  and  Morse  (1975)  have 
Similarly  concluded  that  the  more  riqorous  desiqns  are  less  likely  to  produce 
positive  results.  In  conclusion,  positive  results  obtained  from  the  more 
rigorous  research  designs  are  more  likely  to  indicate  true  intervention 
effects.  Therefore,  program  evaluators  must  attempt  to  employ  more  riqorous 
research  methodology  (Armenakis  et  al.,  1975;  Cunminqs  et  al.,  1977;  Macy  A 
Peterson,  1983;  Margulies  et  al.,  1977;  Pate  et  al.,  1977;  Porras  ft  Berg, 
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1978(a):  Porras  ’■  Patterson,  19/9;  Randolph,  1982;  Terpstra,  1981;  White  ft 
Mitchell,  19/6). 

What  follows  is  a  discussion  of  the  research  methodoloqies  or  experi¬ 
mental  designs  available  to  the  program  evaluator.  First  there  will  be  a 
presentation  of  the  ideal  situation  (the  "true1'  experiment)  followed  by  a 
discussion  of  some  of  the  research  designs  that  are  open  to  many  rival 
hypotheses.  Finally,  a  discussion  of  some  compromising  desiqns,  i.e.,  those 
designs  that  are  not  "true"  experiments  but  do  control  for  many  of  the 
threats  to  internal  validity,  will  be  Dresented. 

Research  Desiqns 


True  experiment  design  True  experimental  designs  are  thought  to  control 
for  all  threats  to  intern  .1  validity  (Rentier  &  Woodward,  1979;  Campbell  & 
Stanley,  1966;  Franklin,  l?'r;  Staw,  1980).  However,  Cook  and  CamDbell 
(1979)  state  that  experimentation  does  not  control  for  these  threats  to 
internal  validity;  imitation,  compensatory  equalization,  and  compensatory 
rivalry.  These  threats  occur  because  of  the  difficulty  in  truly  separating  a 
control  and  an  experimental  or  treatment  group  in  an  organizational  setting. 
For  example,  if  a  supervisor  finds  out  about  an  intervention  occurring  in 
another  work  qroup,  he  or  she  may  behave  differently  in  order  to  compensate, 
thus  preventing  a  true  control. 

In  the  classic  experimental  desiqn  (also  called  the  pretest  -  posttest 
control  qroup  design)  one  first  establishes  the  independent  and  dependent 
variables  of  interest  and  decides  how  they  are  to  be  measured  or  varied. 
Subjects  are  then  chosen  randomly  from  some  larger  and  defined  poDulation  and 
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assigned  by  random  means  to  two  (or  more)  subgroups.  Different  "treatments" 
representing  different  aspects  of  one  or  more  of  the  independent  variables 
are  then  applied  to  the  various  qroups  while  one  or  more  groups  remain 
"untreated;"  l.e.,  they  serve  as  controls  for  the  experimental  procedure. 

The  effect  of  the  exposure  to  the  program  is  determined  by  comparing  any 
changes  In  those  exposed  to  the  treatment  with  changes  in  those  not  exposed 
(Weiss,  1972).  Campbell  et  al.  (1974)  states  that  "judiciously  timed 
measurement  of  the  dependent  variable  across  the  several  groups  and  analysis 
of  differences  among  the  measurements  yield  inferences  about  the  causal 
effects  of  different  levels  of  the  independent  variable  on  the  dependent 
variable"  (p.  174). 

The  strengths  of  this  design  are  achieved  by  randomization  and  the  use  of 
a  control  group.  Randomization  prevents  systematic  differences  in  the 
Initial  status  of  the  experimental  and  control  qroups.  A  substitute  proce¬ 
dure  commonly  used  when  randomization  Is  not  feasible  is  matching  subjects  on 
relevant  characteristics.  However,  Weiss  (1972)  states  that  program 
evaluators  are  often  unable  to  define  the  characteristics  on  which  people 
should  be  matched.  Cook  and  Campbell  (1979)  also  warn  that  matching  as  a 
substitute  for  randomization  can  result  in  regression  effects.  For  example, 
a  group  that  is  Tacking  in  some  desirable  characteristic  might  seek  an 
intervention.  A  pretest  is  given  in  order  to  determine  the  group's  level  on 
this  desirable  characteristic.  The  comparison  group  in  most  cases  will  not 
be  lacking  in  the  characteristic  of  interest  to  the  same  degree  that  the 
treatment  group  is  (if  the  comparison  qroup  was  deficient  to  a  similar  deqree 
it  too  would  have  sought  the  intervention).  Thus,  individual  scores  from  the 
comparison  qroup  should  be  relatively  higher  than  individual  scores  from  the 
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treatment  group.  If  the  proqram  evaluator  decides  to  match  subjects  on  the 
basis  of  their  pretest  scores,  the  matched  subjects  will  represent  different 
ends  of  the  distribution  of  their  respective  qroup.  A  relatively  low  score  in 
the  comparison  qroup  would  be  matched  with  a  relatively  hiqh  score  from  the 
treatment  qroup.  Hue  to  statistical  reqression,  the  low  scores  from  the  com¬ 
parison  qroup  will  regres«  toward  the  mean  of  the  comparison  qroup  on  a  sub¬ 
sequent  posttest.  In  a  similar  fashion,  the  high  scores  from  the  treatment 
group  will  regress  toward  the  mean  of  the  treatment  group  on  a  subsequent 
posttest.  Two  scores  that  were  once  equal  are  new  drastically  different  on 
the  b  s  of  statistical  regression  alone.  This  difference  or  chanqe  is 
commonly  mistaken  for  evidence  that  an  OD  intervention  was  effective. 

Complete  randomization  of  subjects  to  groups  provides  the  tremendous 
advantaqe  of  assuminq  that  the  groups  so  assiqned  do  not  siqnificantly  differ 
from  one  another  prior  to  che  intervention  (Fuqua,  1979).  Thus  by  randomly 
assiqninq  subjects  to  experimental  and  control  groups,  any  differences  between 
these  two  groups  observed  after  the  experimental  group  has  been  exposed  to  an 
intervention  which  were  not  observed  durinq  the  pretest  can  be  attributed  to 
the  effects  of  the  intervention.  In  fact,  in  a  truly  randomized  situation, 
there  is  no  necessity  to  show  that  the  qroups  were  equivalent  throuqh  the  use 
of  a  pretest  (Campbell  %  Stanley,  1966).  Assuming  that  randomization  to 
grouos  insures  similarity,  many  have  arqued  aqainst  the  use  of  the  pretest 
(Campbell,  1957;  Linn  Slinde,  1977).  These  authors  state  that  often  the  act 
of  an  initial  observation  itself  (as  in  a  pretest)  is  reactive  (i.e.,  a  source 
of  internal  invalidity  defined  previously  as  testing).  Adding  pretests  to  the 
design  also  weakens  its  validity  because  the  pretest  may  have  interacted 


with  the  actual  program  to  cause  the  observed  change.  This  gives  rise  to  the 
posttest  only  control  group  design,  to  be  used  if  randomness  Is  assured. 

Methods  of  analyzing  experiments  include:  1)  a  t-test  to  test  the  signi¬ 
ficance  between  the  difference  between  mean  scores  for  the  treatment  and 
control  groups;  2)  a  simple  ANOVA  to  simultaneously  comp  ire  the  means  of 
three  or  four  groups  to  learn  whether  at  least  one  of  them  is  different  from 
the  other  means;  3)  complex  ANOVA  to  study  the  effects  of  more  than  one 
factor  simultaneously;  4)  if  a  pretest  is  given,  an  ANCOVA  could  be  used 
using  the  pretest  score  as  the  covariate. 

Although  true  experimentation  ranks  highest  in  terms  of  providing  valid 
causal  inference,  it  is  not  always  the  most  practical  course  of  action  in 
organizational  settings.  Only  in  rare  instances  are  evaluators  able  to 
exercise  the  amount  of  control  reguired  for  experimental  designs  (Franklin, 
1976).  The  experimental  design  is  exceedingly  difficult  to  apply  in  actual 
field  settings  because  of  the  experimental  requirements  of  randomization  and 
control  group  use  and  the  many  other  unplanned  events  and  interventions 
occurring  differentially  across  groups  (Campbell  et  al . ,  1974;  Evans,  1975). 
Individuals  cannot  always  be  assiqned  to  experimental  and  control  groups 
because  such  a  procedure  might  disruot  normal  population  systems  or  produce 
Inequities  between  experimental  and  control  groups  and  hence  be  considered 
unethical . 

Thus,  difficulties  associated  with  true  experimental  designs  limit  their 
usefulness  in  organization  development  evaluations.  Given  the  practical 
problems  Inherent  in  experimental  designs,  a  number  of  alternatives  to  the 


experimental  design  have  been  suggested  (Campbell  &  Stanley,  1966;  Cook  & 
Campbell,  1979). 

A  simple  and  commonly  used  design  is  the  one  qroup. pretest/posttest 
design.  In  this  design,  observations  are  made  beforeand  after  an  inter¬ 
vention  is  introduced  to  a  single  group.  This  desiqn  can  indicate  whether 
any  change  has  taken  place,  but  is  not  rigorous  enough  to  allow  the  assess¬ 
ment  of  the  intervention's  causal  connection  to  the  observed  changes.  The 
design  is  open  to  many  potential  rival  hypotheses,  including  history,  matura¬ 
tion,  testing,  instrumentation,  mortality,  and  statistical  regression. 

Quasi-experimental  designs.  Because  of  the  limitations  of  the  one  group 
pretest/posttest  desiqns  and  the  impractical ity  of  true  experiments  quasi- 
experimental  designs  are  frequently  employed  (Campbell  £  Stanley,  1966; 

Ouncan,  1981;  Friedlander  .  Brown,-  1974).  According  to  Weiss  (1972),  quasi- 
experimental  designs  have  the  overriding  feature  of  feasibility  and  can  pro¬ 
duce  results  that  are  sufficiently  convincing  of  an  intervention's  causal 
connection  with  observed  changes.  Unlike  true  experiments  designed  to  rule 
out  the  effe^t-s. of  influences  other  than  exposure  to  the  program,  quasi- 
experimental-  designs' often  depend  on  the  possibility  that  these  influences  can 
be  ruled  out  by  statistical  techniques  (Linn  fc  Slinde,  1977).  Instead  of 
randomly  assigning  subjects  to  groups,  quasi-experimental  designs  utilize 
intact  group  that  are  likely  to  be  different  or  "nonequivalent”  on  many 
variables. 

One  of  the  most  popular  quasi-experimental  desiqns  is  the  time  series 
experiment  (Armenakis  et  al.,  1976;  Armenakis  &  Smith,  1978;  Campbell  t 
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Stanley,  1966;  Cook  S.  Campbell,  1979,  Franklin,  1976;  Weiss,  1972).  The 
essence  of  the  time  series  design  is  the  presence  of  a  periodic  measurement 
process  on  a  single  group  that  acts  as  its  own  control  both  prior  to  and 
after  the  introduction  of  an  intervention.  The  effect  of  the  intervention  is 
Indicated  by  a  discontinuity  in  the  measurements  recorded  in  the  time  series. 
This  design  does  not  account  for  the  potential  confound  of  history,  i.e., 
some  other  event  besides  the  intervention  could  account  for  the  observed 
discontinuity.  History  could  be  controlled  if  a  comparison  qroup  is 
employed.  Maturation  is  more  or  less  controlled  for  if  the  time  series  is 
extended.  It  is  not  likely  for  a  maturation  chanqe  to  occur  between 
measurements  in  the  time  series  after  the  intervention  that  did  not  occur 
before  the  intervention.  In  a  similar  way  instrumentation  can  be  accounted 
for.  Selection  and  mortality  are  ruled  out  if  the  same  specific  persons  are 
involved  at  all  observations.  Regression  effects  are  usually  a  neqatively 
accelerated  function  of  elapsed  time  (Campbell,  1969)  and  are  therefore 
Implausible  as  explanations  of  an  effect  after  the  intervention  that  is 
greater  than  the  effects  between  pretest  observations. 

As  many  pretest  and  posttest  measures,  of  the  evaluation  criteria  should 
.be  made  as  possible.  Simple  comparisons  of  one  or  two  pretest.  scores  with 
one  or  two  posttest  scores  may  be  influenced  by  extremes  and  therefore  be 
misleading.  Armenakis  and  Smith  (1978)  recoqnize  that  the  use  of  many  meas¬ 
urements  is  necessary  in  order  to  eliminate  with  confidence  many  of  the 
threats  to  internal  validity  and  to  assess  the  immediate  and  extended  impact 
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of  the  intervention.  However,  the  usefulness  of  this  design  is  limited  if 
repeated  measures  are  made  using  the  questionnaire  approach  solely.  The 
effects  of  testinq  as  a  source  of  internal  invalidity  would  be  compounded  as 
the  number  of  repeated  observations  are  made.  Respondents  may  be  sensitized 
to  the  nature  of  the  chanqes  t.o  be  expected  if  the  same  assessment  device  is 
repeatedly  used.  If,  to  decrease  this  possibility,  the  between  observation 
time  intervals  are  lengthened,  rulinq  out  the  effects  of  hi'tory  become  even 
more  difficult  (Franklin,  1976).  In  order  to  avoid  the  problems  associated 
with  reactivity  to  a  series  of  questionnaire  measurenents,  Macy  and  Peterson 
(1933)  argue  for  the  use  of  archival  and  behavioral  data  in  the  manner  out¬ 
lined  by  Nebb  et  al .  (1966).  Armenakis  and  Smith  (1978)  and  Yer.borq  et  al . 
(1930)  advocate  the  use 'of  a  reduced  number  of  observations. 

Statistics  for  assessi  q  change  in  time  series  designs  must  account  for 
the  fact  that  data  collected  ;n  an  organizational  setting  is  often  not  inde¬ 
pendent;  i.e.,  adjacent  measures  in  the  series  have  a  higher  correlation  than 
non-adjacent  points  (Armenakis  &  Feild,  1975).  This  phenomenon  is  referred 
to  as  autoc<. 'relation  (Campbell,  1963;  Hacy  V  Peterson,  1983:  Tryon,  1982) 
and  has  been  discussed  by  C-ronbach  and  Furby  (1970). 

Armenakis  and  Feild  (1975)  use  a  regression  technique  to  determine  the 
significance  of  the  difference  between  pretest  and  posttest  measurements.  A 
trend  line  is  calculated  for  the  data  before  the  intervention,  after  the 
intervention,  and  one  for  the  entire  research  period.  Variances  from  these 
trend  lines  are  then  calculated  and  an  F  ratio  is  produced.  From  these,  it 
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Is  determined  If  there  were  any  significant  statistical  differences  between 
pretest  and  posttest  performance. 

Tryon  (1982)  uses  the  £  statistic  to  determine  whether  the  time  series 
contains  any  trends,  l.e.,  systematic  departures  from  random  variation.  The 
logic  underlying  the  £  statistic  is  the  same  as  the  logic  underlying  visual 
analysis;  variability  in  successive  data  points  is  evaluated  relative  to 
changes  in  slope  from  one  phase  of  the  time  series  to  another.  The  £ 
statistic  aids  the  evaluator  in  evaluating  how  large  the  squared  deviations 
from  the  mean  are  (which  reflect  the  presence  of  all  types  of  trends)  rela¬ 
tive  to  the  sum  of  the  squared  consecutive  differences  (which  are  independent 
of  all  tyoes  of  trends).  The  loqic  of  this  fraction  is  analogous  to  that  of 
the  F  statistic.  . 

Another  common  quasi -experimental  design  is  the  nonequivalent  comparison 
group  desiqn  (Campbell  %  Stanley,  1966;  Evans,  1975;  Franklin,  1976;  Fuqua, 
1979).  This  design  is  similar  to  the  one  group  pretest/posttest  desiqn  but 
is  different  In  that  it  employs  a  comparison  group.  The  term  "noriequivalent" 
arises  from  the.  fact  that  subjects  are  not  randomly  assiqned  .to  the  program 
or  comparison  groups.  Instead,  in  this  design  qroups  represent  intact  units. 
Consequently,  this  desiqn  presents  the  potential  for  treatment  and  comparison 
groups  which  differ  significantly  from  one  another  before  the  intervention. 
The  more  similar  the  intervention  and  comparison  qroups  are  in  their 
recruitment,  and  the  more  this  similarity  is  confirmed  by  pretest  scores,  the 
more  effective  this  design  is  in  controlling  the  sources  of  invalidity. 
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Including  comparison  groups  permits  a  distinction  to  be  made  between  the 
effects  of  the  program  and  the  several  alternate  plausible  interpretations  of 
change.  Both  treatment  and  comparison  groups  will  have  had  the  same  amount 
of  time  to  mature,  historical  events  will  have  affected  both  equally,  testing 
effects  would  be  the  same  since  both  groups  were  testedtwice,  and  mortality 
could  be  examined  equally  for  both  groups.  The  main  problem  of  the  nonequiv¬ 
alent  control  qroup  design  is  not  selecting  a  comparison  grouo  sufficiently 
similar  to  the  intervention  group.  For  example,  people  choosing  to  enter  a 
program  are  likely  to  be  different  from  those  who  do  not,  and  the  prior  dif¬ 
ferences  might  make  post-intervention  comparisons  tenuous.  Nonequivalent 
control  group  designs  are  especially  sensitive  to  regression  effects  when  the 
treatment  group  has  been  selected  on  the  basis  of  an  extreme  score  on  a 
pretest  (Evans,  1975). 

The  problem  presented  by  the  nonequivalent  control  group  desiqn  basically 
consists  of  eliminating  qroi/o  differences  which  exist  at  pre-intervention 
assessment  from  the  analysis  of  group  differences  at  post-intervention  assess¬ 
ment.  Reichardt  (1979)  provides  a  concise  review  of  the  literature  which 
proposes  analytic  techniques  for  use  with  nonequivalent  control  qroup  designs. 
The  literature  indicates  that  analysis  of  covariance  (ANCOVA)  procedures  have 
received  the  most  attention.  ANCOVA  is  a  statistical  procedure  for  elimi¬ 
nating  the  effects  of  extraneous  sources  of  variance  from  dependent  measures, 
properly  used  only  when  it  is  not  possible  to  use  experimental  controls  to 
achieve  the  same  result. 

Althouqh  ANCOVA  provides  an  attractive  method  for  analyzing  data  from  the 
nonequivalent  control  qroup  desiqn,  it  has  not  proved  wholly  adequate  when 
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used  for  this  purpose.  For  example,  it  has  been  demonstrated  that  measuring 
error  can  have  a  biasing  effect  on  the  analysis  (Campbell  A  Erlebacher,  1970) 
and  that  under  some  conditions  the  analysis  may  either  underad.lust  or  over¬ 
adjust  for  selection  differences  (Cronbach,  Rogosa,  Price,  &  Folden,  1976). 

At  present  there  is  no  sinqle  method  for  analyzing  data  from  the  nonequivalent 
control  group  design  that  will  be  free  of  bias  in  all  situations  (Fuqua, 

1979).  Given  the  current  state  of  analytic  technology,  Reichardt  (1979) 
suggests  that  multiple  analytic  techniques  be  employed. 

The  combination  of  the  time  series  design  and  the  nonequivalent  control 
group  design  yields  a  design  that  is  more  rigorous  than  either  one  by  itself. 
This  combination  design  had  been  given  various  names:  the  multiple  time 
series  design,  the  control  series  design,  the  modified  time  series  design. 

The  combination  design  is  similar  . to  the  time  series  desiqn  hut  is  different 
in  that  a  comparison  group  is  used.  This  added  feature  provides  for  <,  desiqn 
that  rules  out  all  of  the  threats  to  internal  validity  (Campbell  &  Stanley, 
1966;  Franklin,  1976).  The  multiple  measurements  before  the  program  is 
implemented  will  point  out  any  differences  existing  between  the  two  qroups, 
facilitating  the  interpretation  of  any  effects  from  the  intervention.  A 
variant  of  this  design  is  the  interrupted  time  series  with  switching 
replications  (Cook  i  Campbell,  1979).  This  again  is  a  time  series  desiqn  with 
a  comparison  group.  In  this  design  the  comparison  group  receives  the  same 
intervention  as  the  treatment  group  but  at  a  later  time. 

Porras  and  Wilkins  (1980)  used  a  pooled  regression  aDproach  to  test  for 
statistical  differences  between  treatment  and  comparison  group  measures  taken 
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at  multiple  points  in  time.  Treatment  and  comparison  group  data  was  pooled 
to  calculate  the  coefficients  for  one  overall  repression  line.  A  "dummy 
variable"  was  then  added  which  permitted  the  slopes  of  the  two  qroups  to  be 
different.  A  third  line  was  then  estimated  which  permitted  both  slopes  and 
intercepts  to  be  different.  The  amount  of  variance  explained  by  each  of 
these  resultant  lines  was  represented  by  an  R^  for  each  repression.  By 
comparinq  the  R?  for  each  of  the  three  equations,  a  test  was  made  to  deter¬ 
mine  if  letting  the  slopes  or  intercepts  be  different  would  give  a  better  fit 
and  thus  explain  a  greater  proportion  of  variance.  If  the  third  equation 
explained  more  of  the  variance  than  the  first  (pooled)  one,  then  it  could  be 
concluded  that  the  treatment  group  was  performing  differently  than  the  com¬ 
parison  group.  If  the  thirl  equation  showed  a  higher  R^  than  the  second, 
then  the  intercepts  'were  different.  If  the  second  equation  had  a  hiqher  R^ 
than  the  first,  the  slopes  were  different.  Differences  in  intercepts  and 
slopes  were  indicative' of  m  ••'ges  in  behavior  over  time. 


T  he  St  a  t e_of  the  Art  of  OP  Intervention  Evaluations 

Having  described  the  essential  inqredients  necessary  for  proper  evalu¬ 
ations,  the  remainder  of  this  literature  review  will  address  whether  and  to 
what  extent  organization  development  research  has  addressed  these  issues. 
Porras  and  Berg  (1978)  state  that  relatively  little  OD  evaluation  research 
has  been  done,  while  others  describe  the  current  state  of  the  art  of  00 
evaluations  as  underdeveloped  (Fried! ander  &  Brown,  1974;  Margulies  et  al., 
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1977;  Morrison,  1978;  Pate  et  al.,  1977).  Part  of  the  reason  for  this  is 

o  ° 

that  the  field  of  organization  development  is  a  relatively  new  application  of 
the  behavioral  sciences.  Research  methodology  best  suited  for  this  new  field 
is  still  in  its  developmental  stages. 

Administrative  and  methodological  considerations  also  contribute  to  the 
systematic  avoidance  of  proper  OD  intervention  evaluations.  On  the  adminis¬ 
trative  level,  a  divergence  exists  between  the  research- theory  perspective  of 
an  evaluator  and  the  action-change  perspective  of  management.  Management 
personnel  involved  in  the  planning  and  implementation  of  00  interventions  are 
primarily  concerned  with  answers  to  Immediate  problems  (Pate  et  al.,  1977). 
The  pragmatic  emphasis  of  "getting  something  useful"  from  the  00  effort  often 
places  research  in  a  secondary  priority.  Others  hesitate  to  implement  evalu¬ 
ation  of  existing  programs  for  fear  that  evaluation  process  would  interfere 
or  change  the  process  of  development  and  change  already  taking  place 
(Marqulles  et  al.,  1977). 

On  the  methodological  level,  resistance  arises  out  of  the  pessimism  that 
exists  in  trying  to  Implement  a  rigorous  research  design  able  to  control  for 
all  sources  of  internal  invalidity  (Morrison,  1978).  For  example,  organiza¬ 
tional  realities  prevent  random  assignment  to  control  and  treatment  groups 
(Armenakis  et  al.,  1976;  Macy  ft  Peterson,  1983;  Porras  A  Patterson,  1979). 
Field  conditions  of  organization  development  research  also  prevent  the  full 
control  of  such  extraneous  variables  and  influences  as  the  varyinq  deqrees  of 
the  Intervention's  implementation,  multiple  interventions  taking  place  at 
once,  and  the  time  when  post-intervention  criterion  measurements  are  taken. 

OD  research  also  suffers  from  what  is  described  as  a  "criterion  deficiency 


problem"  (Armenakis  et  al.,  1976;  Porras  fc  Berq,  1978b).  Most  standardized 
Instruments  used  in  00  evaluation  research  were  not  designed  specifically  to 
measure  those  variables  and  criteria  that  are  frequently  targeted  in  00 
interventions.  Thus,  00  evaluators  face  a  deficiency  in  the  availability  of 
measures  for  the  criteria  of  interest. 

Despite  the  difficulties  mentioned  ahove,  0D  evaluations  have  been 
attempted.  Recent  literature  reviews  by  Armenakis  et  al .  (1975),  Cummings  et 
al .  (1977),  De  Meuse  and  Liebowitz  (1981),  Pate  et  al .  (1977),  Porras  and 
Berq  (1978a),  Terpstra  (1982),  and  White  and  Mitchell  (1976)  frequently  come 
up  with  similar  results  but  will  arrive  St  markedly  different  conclusions. 

For  example,  Porras  and  Berq  (1978a)  state  that  there  exists  a  reasonably 
larqe  number  of  "scientific"  investigations  of  the  effects  of  00  programs. 

Two  years  later  Porras  and  Wilkins  (1980),  using  the  same  studies  reviewed  by 
Porras  and  Berg  (1973a),  c  nclude  that  there  has  been  a  slow  rate  of 
development  of  00  assessment  'nd  research  methods. 

In  addition  to  arriving  at  different  conclusions  in  the  face  of  similar 
results,  authors  also  reach  different  conclusions  based  on  different  results. 
For  example,  White  and  Mitchell  (1976)  conclude  that  most  00  research  uses 
poor  research  design  while  Porras  and  Berg  (1978b)  state  that  there  is  a 
larqe  number  of  00  studies  usinq  research  designs  possessing  a  high  degree  of 
scientific  riqor.  A  review  of  the  00  literature  thus  does  not  consistently 
provide  an  unequivocal  assessment  of  the  "state  of  the  art"  of  00  interven¬ 
tion  evaluations. 

A  possible  explanation  for  these  diverse  conclusions  miqht  be  found  in 
what  could  be  described  as  "the  floating  criterion"  phenomenon.  Across  the 
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various  literature  relews,  authors  include  studies  only  if  they  meet  certain; 
predetermined  selection  criteria.  These  selection  criteria  vary  from  author 
to  author,  depending  on  a  particular  author's  personal  interpretation  of  what 
an  00  evaluation  should  consist  of.  Thus,  the  criteria  for  selection  (and  the 
subsequent  results  and  conclusions)  "floats"  or  varies  from  author  to  author. 

As  an  example  of  the  occurrence  of  the  "floating  criterion,"  Porras  and 
Berg's  (1978a)  selection  criteria  will  be  examined.  Included  in  their  sample 
of  00  intervention  evaluations  were  only  those  studies  which:  1)  used  "human- 
processual"  interventions;  2)  were  done  in  "representative  seqments"  of  real- 
life  organizations;  3)  measured  organizationally  relevant  process  variables; 
and  4)  used  quantitative  techniques.  Out  of  160  evaluations  surveyed,  only  35 
met  these  criteria.  It  is  on  these  35  studies  that  Porras  and  Berg  (1978a) 
base  their  conclusion  that  the  current  00  evaluations  are  adequately  riqorous 
in  their  research  metriodology.  Had  the  remaining  125  studies  been  included, 
conclusions  more  similar  to  White  and  Mitchell's  (1976)  may  have  been 
reached. 

In  order  for  a  clearer  picture  of  the  current  status  of  0D  evaluations. to 
be' drawn  there  must  be  agreed  upon  standards  of  what  a  proper  0D  evaluation 
consists  of.  The  procedural  outline  provided  by  Hawkridge  (1970)  and  used  in. 
this  paper  suggests  the  necessary  inqredients  of  the  proper  evaluation.  The 
literature  reviews  previously  done  by  Armenakis  et  al .  (1975),  Cunnings  et  al . 
(1977),  De  Meuse  and  Liebcwitz  (1981),  Pate  et  al .  (1977),  Porras  and  Berg 
(1978a),  Terpstra  (1982),  and  White  and  Mitchell  (1976)  will  be  examined  below 
from  the  perspective  outlined  by  Hawkridge  (1970),  i.e.,  the  type  of  criteria 
used,  the  type  of  measures  used,  and  the  type  of  research  design  used  will  be 
examined. 
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Criteria  Measured 


Most  authors  (Cumminqs  et  al.,  1977;  Oe  Meuse  ft  Liebowitz,  1981;  Pate  et 
al.,  1977;  Porras  ft  Berg,  1978a;  Terpstra,  1982;  White  &  Mitchell,  1976)  have 
found  a  predominant  if  not  exclusive  use  of  soft  measures.  Relatively  few  if 
any  Pf  the  research  employed  made  use  of  hard  measures.  An  exception  to  this 
trend  is  Armenakis  et  al.'s  (1975)  finding  that  nearly  three-fourths  of  the 
research  surveyed  used  hard  criterion  measures.  This  finding  is  not 
surprising  since  the  evaluations  selected  for  this  particular  review  were 
chosen  from  organizations  that  were  primarily  profit-oriented. 

Instruments  and  Procedures  Used 


The  use  of  subjective,  attitudinal  questionnaires  is  also  common  in 
current .00  evaluation  rese2r;h  (Armenakis  et  al.,  1975;  Cummings  et  al . , 
1977;  Oe  Meuse  ft  Liebowitz,  1981:  Pate  et  al.,  1977;  Porras  ft  Berg,  1978a; 
Terpstra,  1992;  White  ft  Mitchell,  1976).  Armenakis  et  al .  (1975)  and  Porras 
and  Berq  (1978a)  found  that  tailor-made  questionnaires  are  frequently  used 
when  available.  De  Meuse  and  Liebowitz  (1981),  Porras  and  Berg  (1978a),  and 
Terpstra  (1982)  all  found  an  absence  in  the  use  of  lonqitudinal ,  follow-up 
measures. 

Research  Design  and  Methodology 

White  and  Mitchell  (1976)  found  that  three-fourths  of  the  research  they 
reviewed  failed  to  use  components  necessary  in  order  to  establish  cause  and 


41 


effect  (e.g.,  comparison  groups).  Armenakis  et  al.  (1975)  also  found  an 
infrequent  use  of  comparison  groups.  However,  44%  of  the  studies  reviewed  by 
Armenakis  et  al.  did  employ  the  time  series  desiqn,  which  is  thought  to 
control  for  many  of  the  threats  to  Internal  validity  (Campbell  &  Stanley, 
1966).'  White  and  Mitchell  seem  to  dismiss  the  time  series  desiqn  as  a 
possible  means  of  establishing  cause  and  effect.  Of  those  studies  employing 
a  comparison  group,  Armenakis  et  al .  (1975)  and  Porras  and  Berg  (1978a)  found 
a  frequent  use  of  the  modified  time  series  design  when  a  nonequivalent 
control  group  could  be  identified. 

'  Recommendations 


The  literature  reviews  referred  to  above  demonstrate  some  of  the 
discrepancies  that  exist  between  what  might  be  considered  the  optimal  evalua¬ 
tion  and  the  current  state  of  the  art  of  00  intervention  evaluations.  In 
light  of  these  discrepancies  the  following  recommendations  are  warranted: 

1)  More  adequate  instruments  for  measuring .change  related  to  00 
intervention  criteria  should  be  developed. 


2)  In  addition  to  "soft"  criterion  measures,  “hard"  criterion  measures 
should  be  used. 

3)  In  order  to  determine  the  long-term  effects  and  maintenance  of  a 
program,  multiple  and  longitudinal  measurement  should  be  carried  out. 


4)  Where  the  selection  of  experimental  and  control  qroups  on  a  random 
basis  is  not  possible,  the  use  of  a  comparison  group-even  an 
unmatched  or  nonequivalent  group  -  should  be  used. 
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